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Abstract 

Deep convolutional networks have achieved great suc¬ 
cess for object recognition in still images. However, for ac¬ 
tion recognition in videos, the improvement of deep convo¬ 
lutional networks is not so evident. We argue that there are 
two reasons that could probably explain this result. First the 
current network architectures (e.g. Two-stream ConvNets 
[12]) are relatively shallow compared with those very deep 
models in image domain (e.g. VGGNet [13], GoogLeNet 
[ 15]), and therefore their modeling capacity is constrained 
by their depth. Second, probably more importantly, the 
training dataset of action recognition is extremely small 
compared with the ImageNet dataset, and thus it will be 
easy to over-fit on the training dataset. 

To address these issues, this report presents very deep 
two-stream ConvNets for action recognition, by adapting 
recent very deep architectures into video domain. How¬ 
ever, this extension is not easy as the size of action recog¬ 
nition is quite small. We design several good practices for 
the training of very deep two-stream ConvNets, namely (i) 
pre-training for both spatial and temporal nets, (it) smaller 
learning rates, (Hi) more data augmentation techniques, (iv) 
high drop out ratio. Meanwhile, we extend the Caffe tool¬ 
box into Multi-GPU implementation with high computa¬ 
tional efficiency and low memory consumption. We verify 
the performance of very deep two-stream ConvNets on the 
dataset ofUCFl01 and it achieves the recognition accuracy 
0/91.4%. 

1. Introduction 

Human action recognition has become an important 
problem in computer vision and received a lot of research 
interests in this community [12, 16, 19]. The problem of 
action recognition is challenging due to the large intra-class 
variations, low video resolution, high dimension of video 
data, and so on. 

The past several years have witnessed great progress on 
action recognition from short clips [8, 9, 12, 16, 17, 18, 19]. 


These research works can be roughly categorized into two 
types. The first type of algorithm focuses on the hand¬ 
crafted local features and Bag of Visual Words (BoVWs) 
representation. The most successful example is to extract 
improved trajectory features [16] and employ Fisher vector 
representation [11]. The second type of algorithm utilizes 
deep convolutional networks (ConvNets) to learn video rep¬ 
resentation from raw data (e.g. RGB images or optical flow 
fields) and train recognition system in an end-to-end man¬ 
ner. The most competitive deep model is the two-stream 
ConvNets [12]. 

However, unlike image classification [7], deep ConvNets 
did not yield significant improvement over these traditional 
methods. We argue that there are two possible reasons to 
explain this phenomenon. First, the concept of action is 
more complex than object and it is relevant to other high- 
level vision concepts, such as interacting object, scene con¬ 
text, human pose. Intuitively, the more complicated prob¬ 
lem will need the model of higher complexity. However, 
the current two-stream ConvNets are relatively shallow (5 
convolutional layers and 3 fully-connected layers) com¬ 
pared with those successful models in image classification 
[13, 15]. Second, the dataset of action recognition is ex¬ 
tremely small compared the ImageNet dataset [1]. For ex¬ 
ample, the UCFIOI dataset [14] only contains 13,320 clips. 
However, these deep ConvNets always require a huge num¬ 
ber of training samples to tune the network weights. 

In order to address these issues, this report presents very 
deep two-stream ConvNets for action recognition. Very 
deep two-stream ConvNets contain high modeling capacity 
and are capable of handling the large complexity of action 
classes. However, due to the second problem above, train¬ 
ing very deep models in such a small dataset is much chal¬ 
lenging due to the over-fitting problem. We propose several 
good practices to make the training of very deep two-stream 
ConvNets stable and reduce the effect of over-fitting. By 
carefully training our proposed very deep ConvNets on the 
action dataset, we are able to achieve the state-of-the-art 
performance on the dataset of UCFIOI. Meanwhile, we ex¬ 
tend the Caffe toolbox [4] into multi-GPU implementation 
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with high efficiency and low memory consumption. 

The remainder of this report is organized as follows. In 
Section 2, we introduce our proposed very deep two-stream 
ConvNets in details, including network architectures, train¬ 
ing details, testing strategy. We report our experimental re¬ 
sults on the dataset of UCFIOI in Section 3. Finally, we 
conclude our report in Section 4. 

2. Very Deep Two-stream ConvNets 

In this section, we give a detailed description of our pro¬ 
posed method. We first introduce the architectures of very 
deep two-stream ConvNets. After that, we present the train¬ 
ing details, which are very important to reduce the effect of 
over-fitting. Finally, we describe our testing strategies for 
action recognition. 

2.1. Network architectures 

Network architectures are of great importance in the de¬ 
sign of deep ConvNets. In the past several years, many 
famous network structures have been proposed for im¬ 
age classification, such as AlexNet [7], ClarifaiNet [22], 
GoogLeNet [15], VGGNet [13], and so on. Some trends 
emerge during the evolution of from AlexNet to VGGNet: 
smaller convolutional kernel size, smaller convolutional 
strides, and deeper network architectures. These trends 
have turned out to be effective on improving object recogni¬ 
tion performance. However, their influence on action recog¬ 
nition has not be fully investigated in video domain. Here, 
we choose two latest successful network structures to de¬ 
sign very deep two-stream ConvNets, namely GoogLeNet 
and VGGNet. 

GoogLeNet. It is essentially a deep convolutional net¬ 
work architecture codenamed Inception, whose basic idea is 
Hebbian principle and the intuition of multi-scale process¬ 
ing. An important component in Inception network is the 
Inception module. Inception module is composed of multi¬ 
ple convolutional filters with different sizes alongside each 
other. In order to speed up the computational efficiency, 
1x1 convolutional operation is chosen for dimension re¬ 
duction. GoogLeNet is a 22-layer network consisting of In¬ 
ception modules stacked upon each other, with occasional 
max-pooling layers with stride 2 to halve the resolution of 
grid. More details can be found in its original paper [15]. 

VGGNet. It is a new convolutional architecture with 
smaller convolutional size (3 x 3), smaller convolutional 
stride (1 x 1), smaller pooling window (2 x 2), deeper 
structure (up to 19 layers). The VGGNet systematically 
investigates the influence of network depth on the recog¬ 
nition performance, by building and pre-training deeper ar¬ 
chitectures based on the shallower ones. Finally, two suc¬ 
cessful network structures are proposed for the ImageNet 
challenge: VGG-16 (13 convolutional layers and 3 fully- 
connected layers) and VGG-19 (16 convolutional layers and 


3 fully-connected layers). More details can be found in its 
original paper [13]. 

Very Deep Two-stream ConvNets. Following these 
successful architectures in object recognition, we adapt 
them to the design of two-stream ConvNets for action 
recognition in videos, which we called very deep two- 
stream ConvNets. We empirically study both GoogLeNet 
and VGG-16 for the design of very deep two-stream Con¬ 
vNets. The spatial net is built on a single frame image 
(224 X 224 X 3) and therefore its architecture is the same 
as those for object recognition in image domain. The input 
of temporal net is 10-frame stacking of optical flow fields 
(224 X 224 X 20) and thus the convolutional filters in the 
first layer are different from those of image classification 
models. 

2.2. Network training 

Here we describe how to train very deep two-stream 
ConvNets on the UCFIOI dataset. The UCFIOI dataset 
contains 13, 320 video clips and provides 3 splits for eval¬ 
uation. For each split, there are around 10, 000 clips for 
training and 3300 clips for testing. As the training dataset is 
extremely small and the concept of action is relatively com¬ 
plex, training very deep two-stream ConvNets is quite chal¬ 
lenging. From our empirical explorations, we discover sev¬ 
eral good practices for training very deep two-stream Con¬ 
vNets as follows. 

Pre-training for Two-stream ConvNets. Pre-training 
has turned out to be an effective way to initialize deep Con¬ 
vNets when there is not enough training samples available. 
For spatial nets, as in [12], we choose the ImageNet mod¬ 
els as the initialization for network training. For temporal 
net, its input modality are optical flow fields, which capture 
the motion information and are different from static RGB 
images. Interestingly, we observe that it still works well by 
pre-training temporal nets with ImageNet model. In order 
to make this pre-training reasonable, we make several mod¬ 
ifications on optical flow fields and ImageNet model. First, 
we extract optical flow fields for each video and discretize 
optical flow fields into interval of [0, 255] by a linear trans¬ 
formation. Second, as the input channel number for tempo¬ 
ral nets is different from that of spatial nets (20 vs. 3), we 
average the ImageNet model filters of first layer across the 
channel, and then copy the average results 20 times as the 
initialization of temporal nets. 

Smaller Learning Rate. As we pre-trained the two- 
stream ConvNets with ImageNet model, we use a smaller 
learning rate compared with original training in [12]. 
Specifically, we set the learning rate as follows: 

• For temporal net, the learning rate starts with 0.005, 
decreases to its 1/10 every 10,000 iterations, stops at 
30,000 iterations. 


• For spatial net, the learning rate starts with 0.001, 
decreases to its 1/10 every 4,000 iterations, stops at 
10,000 iterations. 

In total, the learning rate is decreased 3 times. At the same 
time, we notice that it requires less iterations for the training 
of very deep two-stream ConvNets. We analyze that this 
may be due to the fact we pre-trained the networks with the 
ImageNet models. 

More Data Augmentation Techniques. It has been 
demonstrated that data augmentation techniques such as 
random cropping and horizontal flipping are very effective 
to avoid the problem of over-fitting. Here, we try two new 
data augmentation techniques for training very deep two- 
stream ConvNets as follows: 

• We design a corner cropping strategy, which means we 
only crop 4 corners and 1 center of the images. We 
And that if we use random cropping method, it is more 
likely select the regions close to the image center and 
training loss goes quickly down, leading to the over¬ 
fitting problem. However, if we constrain the cropping 
to the 4 corners or 1 center explicitly, the variations of 
input to the network will increase and it helps to reduce 
the effect of over-fitting. 

• We use a multi-scale cropping method for training very 
deep two-stream ConvNets. Multi-scale representa¬ 
tions have turned out to be effective for improving 
the performance of object recognition on the ImageNet 
dataset [13]. Here, we adapt this good practice into the 
task of action recognition. But we present an efficient 
implementation compared with that in object recogni¬ 
tion [13]. We fix the input image size as 256 x 340 and 
randomly sample the cropping width and height from 
{256,224,192,168}. After that, we resize the cropped 
regions to 224 x 224. It is worth noting that this crop¬ 
ping strategy not only introduces the multi-scale aug¬ 
mentation, but also aspect ratio augmentation. 

High Dropout Ratio. Similar to the original two-stream 
ConvNets [12], we also set high drop out ratio for the fully 
connected layers in the very deep two-stream ConvNets. In 
particular, we set 0.9 and 0.8 drop out ratios for the fully 
connected layers of temporal nets. For spatial nets, we set 
0.9 and 0.9 drop out ratios for the fully connected layers . 

Multi-GPU training. One great obstacle for applying 
deep learning models in video action recognition task is the 
prohibitively long training time. Also the input of multi¬ 
ple frames heightens the memory consumption for storing 
layer activations. We solve these problems by employing 
data-parallel training on multiple GPUs. The training sys¬ 
tem is implemented with Caffe [4] and OpenMPI. Follow¬ 
ing a similar technique used in [3], we avoid synchronizing 
the parameters of fully connected (fc) layers by gathering 


the activations from all worker processes before running 
the fc layers. With 4 GPUs, the training is 3.7x faster for 
VGGNet-16 and 4.Ox faster for GoogLeNet. It takes 4x less 
memory per GPU. The system is publicly available '. 

2.3. Network testing 

For fair comparison with the original two-stream Con¬ 
vNets [ 12], we follow their testing scheme for action recog¬ 
nition. At the test time, we sample 25 frame images or op¬ 
tical flow fields for the testing of spatial and temporal nets, 
respectively. From each of these selected frames, we obtain 
10 inputs for very deep two-stream ConvNets, i.e. 4 cor¬ 
ners, 1 center, and their horizontal flipping. The final pre¬ 
diction score is obtained by averaging across the sampled 
frames and their cropped regions. For the fusion of spatial 
and temporal nets, we use a weighted linear combination 
of their prediction scores, where the weight is set as 2 for 
temporal net and 1 for spatial net. 

3. Experiments 

Datasets and Implementation Details. In order to ver¬ 
ify the effectiveness of proposed very deep two-stream Con¬ 
vNets, we conduct experiments on the UCFIOI [14] dataset. 

The UCFIOI dataset contains 101 action classes and there 
are at least 100 video clips for each class. The whole 
dataset contains 13,320 video clips, which are divided into 
25 groups for each action category. We follow the evalua¬ 
tion scheme of the THUMOS13 challenge [5] and adopt the 
three training/testing splits for evaluation. We report the av¬ 
erage recognition accuracy across classes over these three 
splits. For the extraction of optical flow fields, we follow 
the work of TDD [19] and choose the TVLl optical flow al¬ 
gorithm [21]. Specifically, we use the OpenCV implemen¬ 
tation, due to its balance between accuracy and efficiency. 

Results. We report the action recognition performance 
in Table 1. We compare three different network archi¬ 
tectures, namely ClarifaiNet, GoogLeNet, and VGGNet- 
16. From these results, we see that the deeper architec¬ 
tures obtains better performance and VGGNet-16 achieves 
the best performance. For spatial nets, VGGNet-16 out¬ 
perform shallow network by around 5%, and for temporal 
net, VGGNet-16 is better by around 4%. Very deep two- 
stream ConvNets outperform original two-stream ConvNets 
by 3.4%. 

It is worth noting in our previous experience [20] in 
THUMOS15 Action Recognition Challenge [2], we have 
tried very deep two-stream ConvNets but temporal nets 
with deeper structure did not yield good performance, 
as shown in Table 2. In this THUMOS15 submission, we 
train the very deep two-stream ConvNets in the same way as 
the original two-stream ConvNets [12] without using these 

'https : //github . oom/yjxiong/oaffe/tree/action_recog 




Spatial nets 

Temporal nets 

Two-stream ConvNets 

Architecture 

Split 1 

Split2 

Split3 

Average 

Split 1 

Split2 

Split3 

Average 

Split 1 

Split2 

Split3 

Average 

ClarifaiNet (from [12]) 

72.7% 

- 

- 

73.0% 

81.0% 

- 

- 

83.7% 

87.0% 

- 

- 

88.0% 

GoogLeNet 

77.1% 

73.2% 

75.6% 

75.3% 

83.9% 

86.5% 

86.9% 

85.8% 

89.0% 

89.3% 

89.5% 

89.3% 

VGGNet-16 

79.8% 

77.3% 

77.8% 

78 . 4 % 

85.7% 

88.2% 

87.4% 

87 . 0 % 

90.9% 

91.6% 

91.6% 

91 . 4 % 


Table 1. Performance comparison of different architectures on the UCFIOI dataset, (with using our proposed good practices) 


Spatial nets 

Temporal nets 

ClarifaiNet 

GoogLeNet 

VGGNet-16 

ClarifaiNet 

GoogLeNet 

VGGNet-11 

42.3% 

53.7% 

54 . 5 % 

47 . 0 % 

39.9% 

42.6% 


Table 2. Performance comparison of different architectures on the THUMOS15 [2] validation dataset, (from [20], without using our 
proposed good practices) 


Method 

Year 

Accuracy 

iDT-l-FV [16] 

2013 

85.9% 

iDT-nHSV [10] 

2014 

87.9% 

MIFS-l-FV [8] 

2015 

89.1% 

TDD-hFV [19] 

2015 

90.3% 

DeepNet [6] 

2014 

63.3% 

Two-stream [12] 

2014 

88.0% 

Two-stream-tLSTM [9] 

2015 

88.6% 

Very deep two-stream 

2015 

91 . 4 % 


Table 3. Performance comparison with the state of the art on 
UCFIOI dataset. 


proposed good practices. From the different performance of 
very deep two-stream ConvNets on two datasets, we con¬ 
jecture that our proposed good practices is very effective to 
reduce the effect of over-fitting due to (a) pre-training tem¬ 
poral nets with the ImageNet models; (b) using more data 
augmentation techniques. 

Comparison. Finally, we compare our recognition accu¬ 
racy with several recent methods and the results are shown 
in Table 3. We first compare with Fisher vector represen¬ 
tation of hand-crafted features like Improved Trajectories 
(iDT) [16] or deep-learned features like Trajectory-Pooled 
Deep-Convolutional Descriptors (TDD) [19]. Our results is 
better than all these Fisher vector representations. Second, 
we perform comparison between the very deep two-stream 
ConvNets with other deep networks such as DeepNets [6] 
and two-stream ConvNets with recurrent neural networks 
[9]. We see that our proposed very deep models outperform 
previous ones and are better than the best result by 2.8%. 

4. Conclusions 

In this work we have evaluated very deep two-stream 
ConvNets for action recognition. Due to the fact that ac¬ 
tion recognition dataset is extremely small, we proposed 
several good practices for the training very deep two-stream 


ConvNets. With our carefully designed training strategies, 
the proposed very deep two-stream ConvNets achieved the 
recognition accuracy of 91.4% on the UCFIOI dataset. 
Meanwhile, we extended the famous Caffe toolbox into 
Multi-GPU implementation with high efficiency and low 
memory consumption. 
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